Who am I

I am a PhD student in linguistics with some previous experience in R and RStudio as well as basic statistics, but with this course I hope to gain more understanding in statistical methods and open data that I could use in my research. My GitHub repository.


Regression and model validation

For the tasks this week I wrangle data into data frames in the form required for analysis. I have used the data to analyse the relationship between a dependent and an independent variable by fitting a linear regression model. I can also make plots from the residuals to diagnose the appropriateness of the model.

## Warning: package 'ggplot2' was built under R version 3.3.3

Introduction

In this analysis we investigate the how the points scored in a statistics test by a student are explained by different learning approaches and the student’s attitude, age and gender. To do this, we shall analyse the explanatory powers of the different potential explanatory variables by building a regression model where exam points is the dependent variable.

Data overview

The data for this analysis is a survey dataset listing the learning approaches in a statistics class. More information on its collection can be found here. The data has 166 observations and 7 variables. Below follows a sample of what the dataset used the purposes of this study look like.

##   gender Age Attitude     deep  stra     surf Points
## 1      F  53       37 3.583333 3.375 2.583333     25
## 2      M  55       31 2.916667 2.750 3.166667     12
## 3      F  49       25 3.500000 3.625 2.250000     24
## 4      M  53       35 3.500000 3.125 2.250000     10
## 5      M  49       37 3.666667 3.625 2.833333     22

As can be seen, observations include the gender and age of the respondents as well as the attitude towards statistics, and max points. It should be noted that respondents whose score in Points equals 0 have been removed from the current 166-observation dataset. The scores for variables deep, strategic and surface approaches to learning are the means of the combined sums from individual variables: for example, deep represents the mean of the scores for the following variables.

deep <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06",  "D15", "D23", "D31")

Correlation analysis

Graphical overview

The following offers a overview visualisation of correlations within the data. It is for example noteworthy that there are twice as many women as men. In terms of correlations, at 0.437 the strongest correlation is that between Points and Attitude, noticeable among both genders. The lowest (absolute) correlation is between Attitude and Age. This suggests that attitude towards statistics learning does not depend on the age of the respondent, but attitude does play an important role in the points scored by the respondent. Finally, although low, two more correlations worth noting are the 0.146 correlation between Points and strategic approach, and the -0.144 correlation between Points and surface approach.

In addition, the correlation between surface and deep approach, although minimal among women, is -0.622 among men, suggesting that men who use surface approach are less likely to use deep approach and vice versa. As seen in the figure, there is also some negative correlation between surface approach and Attitude, again more among men, as well as surface and strategic approach.

Regression model

Based on the overview above, we shall first investigate the potential of a lineral model that assumes Points are explained by the variables Attitude, surface and strategy. The results of this model are given below:

## 
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## surf        -0.58607    0.80138  -0.731  0.46563    
## stra         0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to the above results, the only variable for which the p-value is <0.001 is Attitude, the variables relating to approach being > 0.05. This means that for these variables we fail to reject the null hypothesis. Thus, below we shall fit a new model with only the statistically significant variable Attitude.

Relationship between explanatory and target variable

The results above suggested that only Attitude can be used to explain the dependent variable Points. In other words, our model is:

Points = α + βAttitude + ε

where

  • α is Points when Attitude = 0

  • β is our coefficient

  • ε is an unobservable random, normally distributed variable

The results and a plot illustrating this model are given below:

## 
## Call:
## lm(formula = Points ~ Attitude, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The line is positioned so as to have the lowest residuals (differences between observed value and predicted value), but quite a few observations fall further outside of the line. Indeed, the multiple R-squared is 0.1906, indicating that only 19 % of the Points-observations are explained by the variable Attitude. However, the p-value <0.001 gives reason to believe the variable is nonetheless statistically significant.

Diagnostic plots

The current model assumes that Attitude can be explained by a single explanatory variable, Attitude, and that linear regression model (as opposed to e.g. a polynomial regression model) is sufficient for explaining and predicting the observations. To diagnose the model, we shall plot the residuals. Plotting the residuals and fitted, we can see that the model is reasonable.

In the Residuals vs Fitted and Residuals vs Leverage plots, the observations are seemingly evenly distributed and there is no visible pattern that would suggest the need for a polynomial regression model. In the Normal Q-Q plot, most the observations line up on the same line, with some deviation only among the lowest and the highest values. This indicates the model is adequate for explaining our dependent variable and that we do not need to resort to a more complex model.


Logistic regression

In this section I use a logistic regression model to study what variables impact alcohol consumption.

Data overview

The data for the present study is a dataset on alcohol consumption in secondary school students in two schools in Portugal. More information (and download) is available here. The variables include varibales on e.g. the student’s background (family size, education and employment of parents), behaviour (e.g. amount of freetime, health status) as wekk as performance in the subjects of Maths and Portuguese. In the dataset of this study, the original data has already been modified so that variables for alcohol consumption on weekdays and alcohol consumption on weekends has been interpreted into a single variable alc_use, which is the mean of the two. Further, a binary column high_usehas been created based on alc_use, where “high_use = TRUE” signifies a mean alcohol consumption greater than 2.

A list of all variables is shown below:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Several of the variables are numeric, using a Likert-scale type 1-5 range (very low - very high), including the values for alcohol consumption. Other variables are binary yes/no data, such as whether the student is in a romantic relationship.

Variables impacting alcohol consumption

Hypotheses

The purpose of this study is to explore how variables predict high/low alcohol consumption among the observations. I hypothesise that alcohol consumption is related to the following four variables:

  • Sex: it is expected that male students have a higher consumption rate than female students.
  • Study time: students who study longer per week are expected to have a lower alcohol consumption (negative correlation).
  • Going out with friends: students who go out will have a higher alcohol consumption rate (positive correlation).
  • Absences: students with many absences from school are expected to have higher alcohol consumption (positive correlation).

The hypotheses above reflect only my current intuitions and are not based on research literature on alcohol consumption oron previous familiarisations with the data. It should also be mentioned that the purpose of this analysis is to study the relationship, but not even a strong relationship is necessarily a sign of causality, let alone a sign of which variable is the cause and which the effect.

Distribution of chosen variables

1. Sex

The first hypothesis is that male students have higher alcohol consumption than female students. It should be noted that the number of females (198) is around the same as males (184). Below the numbers of high consumption have been given in proportions.

The figure shows that our hypothesis is supported. As can be seen, male students have a higher percentage off alcohol consumption, supporting the hypothesis that there are more men with high consumption rates than women.

2. Study time

Our second hypothesis is that students who study more are less likely to have a high alcohol consumption rate (negative correlation). Study time is measured with the following 4-point scale, representing study hours per week: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours. In the figure below, the findings are again split across sexes. A boxplot was chosen as in spite of the values being numbers, Likert-type data is better thought of as ordinal rather than continuous data, and any averages or quartiles produced from it are misguiding at best.

The graph indicates that high consumption is indeed more common among students with lower weekly totals of study time. Interestingly, male students have a break in the pattern, with students who study 5-10 hours per week having a noticeably lower number of high consumers than all other groups apart from females studying more than 10 hours. However, our overall hypothesis is still supported.

3. Going out with friends

Here it is hypothesised that students who go out are more likely to have high consumption rates. The frequency of going out with friends is here measured using a numeric 5-step Likert-scale, where 1 = very low, and 5 = very high. Again, we examine this first in boxplots.

It appears that high consumption is more common among students who go out frequently, but this is especially pronounced among male students, whereas the female students who go out the most have a lower number of high consumers than the second-highest female group. Thus, our hypothesis is applicable mainly for male students.

4. Absences

Another assumption that we have is that students with high alcohol consumption are also more likely to have more absences from school. Absences are here a numeric number representing the number of absences. We shall explore the numerical value in a boxplot, where the dotted line indicates the mean.

The means indicate a difference, but this may be because of outliers. Not counting the outliers marked as dots beyond the whiskers, the boxplots above suggest that there may be a relationship between absences and alcohol among male students, but among female students the differences are quite small, the medians being about even and the main differences being visible in the top 50%. Based only on this, we cannot outright claim that this supports our hypothesis.

Fitting a logistic regression model

In this section we shall fit a logistic regression model that could explain and predict high alcohol use among the population. We shall at least not yet remove any of the variables, and our model and its statistical summary is thus as follows.

high_use ~ sex + studytime + goout + absences

## 
## Call:
## glm(formula = high_use ~ sex + studytime + goout + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9103  -0.7892  -0.5021   0.7574   2.6021  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.20483    0.59429  -5.393 6.94e-08 ***
## sexM         0.78173    0.26555   2.944 0.003242 ** 
## studytime   -0.42116    0.17058  -2.469 0.013551 *  
## goout        0.72677    0.11994   6.059 1.37e-09 ***
## absences     0.07811    0.02244   3.481 0.000499 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 381.31  on 377  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 4

Based on the above, our highly statistically significant p<0.001 variables are goout and absences, with the variable sex having a statistically significant p-value of <0.01, and studytime having a statistically significant p-value of <0.05. In other words, we fail to reject the null hypotheses for all our variables.

The coefficients of the variables are as follows:

## (Intercept)        sexM   studytime       goout    absences 
## -3.20483157  0.78173290 -0.42116184  0.72676824  0.07810746

As hypothesised, study time is the only explanatory variable with negative correlation. Absences has the lowest absolute coefficient. We shall take the exponents of the coefficients to analyse the odds ratios of the variables as well as their confidence intervals.¨

Odds are calculated as p/(1-p), where P is probability. The odds ratio is the ratio of the odds of two values, thus quantifying their relationship. The odds ratio also represents the increase in odds for an increase in the variable by one unit. The confidence interval then explains the precision of the odds ratio. A 97.5 % confidence interval indicates a range for how precise the odds ratio is: the interval indicates a range with 97.5 % probability of this calculation being correct for this population.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Waiting for profiling to be done...
##              OddsRatio      2.5 %    97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM        2.18525582 1.30370562 3.7010763
## studytime   0.65628388 0.46510383 0.9099505
## goout       2.06838526 1.64470314 2.6349716
## absences    1.08123884 1.03577383 1.1324143

The odds ratios show the increase in the odds for increased high alcohol consumption for every increase in category of study time, going out or absences. For sex, it indicates that the odds of a man having high alcohol consumption is 2.18 higher than women. Ranked from highest to lowest odds ratio, i.e. the most significant change in odds per unit, sex is at the top and study time the lowest. However, the confidence intervals are fairly wide, meaning that a large range is needed to be 97.5 % certain of the matter when it comes to the total population.

Predictive accuracy

Finally, we shall evaluate the prediction accuracy of our logistic regression model. That is, if the model was used to predict each observation, with what frequency would it be able to correctly categorize an observation as having either high or low alcohol consumption rate. In this case the accuracy of our model is as follows (in n of observations and proportions):

##         prediction
## high_use FALSE TRUE
##    FALSE   252   16
##    TRUE     65   49
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65968586 0.04188482 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.82984293 0.17015707 1.00000000

The so-called confusion matrix above (with rounded proportions) indicates that our model categorises 66.0 % (N = 252) of the observations have a low consumption rate and are by the model predicted as such. Correspondingly, 12.8 % (N = 29) of observations have both high consumption and are accurately predicted as such. The remaining were predicted incorrectly (have high use but were predicted as low, or vice versa).

In short, approx. 78.9 % of the observations were predicted correctly and 21.2 % (0.2120419) were predicted incorrectly. Visualised, the data appears as follows:

Supposing that we had not fitted a logistic regression model but simply guessed our way, the number of correct guesses can be different (ideally lower) than the number of corrects predicted by the model. If we assumed none of the students to have a high alcohol consumption rate (probability = 0): our error rate would be as the following:

## [1] 0.2984293

We shall also test the predictive power of a 50/50 guessing method where the probability for a student to have a high alcohol consumption rate is assumed to be 0.5 (i.e. the rate is as likely to be high as to be low). We calculate the mean of incorrect guesses as:

## [1] 0

Interestingly, our calculation would indicate a mean of incorrect guesses is 0, i.e. that all guesses are correct. I find this somewhat suspicious, but unfortunately the investigation of the accuracy of this falls outside of the scope of this analysis.

Bonus

10-fold validation

Lastly, we shall use the so-called 10-fold validation to assess the accuracy of the model fitted above. This method is a form of cross-validation and the main goal is therefore to validate the model on an independent data set that has not been used for training the model. In 10-fold validation, the data is split into 10 parts, from which 9 parts are used for training, and the remaining 1 for testing. The process is then repeated with a different part functioning as testing set. The mean prediction error is as follows:

## [1] 0.2120419

In summary, the predictive of the model on test data is very close to the predictive power of our model (0.2120419). The predictive error is also lower than that of a similar model tested in the DataCamp exercises leading up to this analysis, where the explanatory variables were assumed to be sex, absences and failures in classes.


Clustering and classification

In this analysis, we apply clustering and classification methods to study a dataset. Based on classification from training data we are able to classify also training data.

Introduction

## Warning: package 'MASS' was built under R version 3.3.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Warning: package 'tibble' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## nasa():   dplyr, GGally
## select(): dplyr, MASS
## Warning: package 'corrplot' was built under R version 3.3.3
## corrplot 0.84 loaded

Data description

From the R package “MASS”, the data we shall use for this analysis represents housing values in Boston from the 1970s. A quick overview of the variables is provided below, where the variables stand for the following (Source):

  • crim: per capita crime rate by town.

*zn: proportion of residential land zoned for lots over 25,000 sq.ft.

*indus: proportion of non-retail business acres per town.

  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox: nitrogen oxides concentration (parts per 10 million).

  • rm: average number of rooms per dwelling.

  • age: proportion of owner-occupied units built prior to 1940.

  • dis: weighted mean of distances to five Boston employment centres.

  • rad: index of accessibility to radial highways.

  • tax: full-value property-tax rate per $10,000.

  • ptratio: pupil-teacher ratio by town.

  • black: 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town.

  • lstat: lower status of the population (percent).

  • medv: median value of owner-occupied homes in $1000s.

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The variable rad and the dummy variable chas are the only integer variables, the remaining being numerical.

Graphical overview

Scaling the data

Before we begin our analysis, we shall standardise the dataset. This means that we assume normal distribution and adjust the numeric values of variables so that mean = 0, all values indicating a distance from the mean in units of standard deviation. The data now looks as follows (cf. with table above):

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Before continuing, we shall modify the variable crim into a categorical factor variable based on its quantiles, seen above as its values for Min, 1st Qu., Median, 3rd Qu., and Max. The variable (renamed crime) now has the following factors and observations per factor:

## 
##      low  med_low med_high     high 
##      127      126      126      127

We shall also divide the dataset into a testing and training set. The training set will consist of a sample of 80 % of the whole dataset, the testing set the remaining 20 %.

##        zn                indus               chas         
##  Min.   :-0.487240   Min.   :-1.51549   Min.   :-0.27233  
##  1st Qu.:-0.487240   1st Qu.:-0.90400   1st Qu.:-0.27233  
##  Median :-0.487240   Median :-0.43683   Median :-0.27233  
##  Mean   :-0.008025   Mean   :-0.06294   Mean   :-0.07933  
##  3rd Qu.: 0.048724   3rd Qu.: 1.01499   3rd Qu.:-0.27233  
##  Max.   : 3.371702   Max.   : 2.42017   Max.   : 3.66477  
##       nox                 rm                age          
##  Min.   :-1.35225   Min.   :-3.44659   Min.   :-2.20168  
##  1st Qu.:-0.78268   1st Qu.:-0.52075   1st Qu.:-0.70784  
##  Median :-0.14407   Median : 0.08165   Median : 0.49292  
##  Mean   : 0.05863   Mean   : 0.06554   Mean   : 0.06654  
##  3rd Qu.: 0.95838   3rd Qu.: 0.43604   3rd Qu.: 0.96097  
##  Max.   : 2.72965   Max.   : 3.55153   Max.   : 1.11639  
##       dis                rad               tax          
##  Min.   :-1.24464   Min.   :-0.9819   Min.   :-1.27709  
##  1st Qu.:-0.82515   1st Qu.:-0.6373   1st Qu.:-0.77572  
##  Median :-0.34430   Median :-0.5225   Median :-0.51761  
##  Mean   :-0.06925   Mean   : 0.1114   Mean   : 0.08661  
##  3rd Qu.: 0.42145   3rd Qu.: 1.6596   3rd Qu.: 1.52941  
##  Max.   : 3.28405   Max.   : 1.6596   Max.   : 1.79642  
##     ptratio             black              lstat         
##  Min.   :-2.70470   Min.   :-3.86850   Min.   :-1.32936  
##  1st Qu.:-0.30279   1st Qu.: 0.21829   1st Qu.:-0.91696  
##  Median : 0.34387   Median : 0.39751   Median :-0.24269  
##  Mean   : 0.05767   Mean   :-0.03202   Mean   : 0.02231  
##  3rd Qu.: 0.80578   3rd Qu.: 0.42736   3rd Qu.: 0.68329  
##  Max.   : 1.26768   Max.   : 0.44062   Max.   : 3.09715  
##       medv               crime   
##  Min.   :-1.90634   low     :24  
##  1st Qu.:-0.77283   med_low :22  
##  Median :-0.06881   med_high:25  
##  Mean   :-0.04610   high    :31  
##  3rd Qu.: 0.23564                
##  Max.   : 2.98650

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a classification method for continuous normally distributed classes such as in our scaled dataset in this analysis. Using crime as our target variable, we shall use LDA on our training set so as to later test it on the testing set

## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2574257 0.2500000 0.2376238 
## 
## Group means:
##                   zn      indus        chas        nox            rm
## low       0.93561469 -0.8791167 -0.15765625 -0.8584288  0.4030790815
## med_low  -0.07990739 -0.2511715  0.03052480 -0.5663620 -0.1441037597
## med_high -0.40063686  0.2518350  0.19544522  0.4040450  0.0007740068
## high     -0.48724019  1.0172418  0.01475116  1.0471957 -0.3468119961
##                 age        dis        rad        tax     ptratio
## low      -0.8749995  0.8635262 -0.6652065 -0.7384750 -0.43150018
## med_low  -0.3251565  0.3539289 -0.5456746 -0.4655254 -0.06962111
## med_high  0.3980728 -0.3729760 -0.4281054 -0.2932605 -0.28724521
## high      0.8015457 -0.8439374  1.6368728  1.5131579  0.77931510
##                black        lstat         medv
## low       0.37854560 -0.740236300  0.493529082
## med_low   0.31106105 -0.087035349 -0.009839905
## med_high  0.07292471  0.009983557  0.135844053
## high     -0.78582905  0.854296323 -0.612794122
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09910915  0.737784065 -0.86556739
## indus    0.05813063 -0.226754912  0.24539648
## chas    -0.01303732 -0.004699865  0.18182729
## nox      0.37458161 -0.765119575 -1.39662404
## rm       0.05375753 -0.033873818 -0.07454071
## age      0.27193353 -0.326037635 -0.10504199
## dis     -0.04080143 -0.349824752  0.18558372
## rad      3.40146992  1.093227307 -0.16831228
## tax      0.04155715 -0.242355330  0.66426756
## ptratio  0.12926685  0.016984723 -0.22768789
## black   -0.08724479  0.039693546  0.11053220
## lstat    0.23390129 -0.184184755  0.54859529
## medv     0.11103755 -0.481229966 -0.15722868
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9532 0.0348 0.0120

Based on the above, the 1st linear discriminant (LD1) explains as much as 95 % of the variance, LD2 explaining 3.6 % and LD3 1.2 %. Visualising the observations across LD1 and LD2, the data looks as follows:

Predicting classes in test data

Based on the classification data of the LDA above, we shall classify the observations of the test data, i.e. we are testing the predictive accuracy of the LDA on data not part of the model’s training data. The results of the prediction are as follows:

##           predicted
## correct    low med_low med_high high
##   low       18       4        2    0
##   med_low    3      16        3    0
##   med_high   1       9       13    2
##   high       0       0        0   31

Out of a total of 102 observations, the number of incorrectly classified observations is 30 = 4 + 9 + 9 + 5 + 2 + 1, i.e. 29.4 %.

Distances

Here we shall investigate the distances between observations. To do this, we begin with a fresh dataset, i.e. the original Bostond dataset prior to the modifications described above. We standardise the dataset and

data('Boston')
boston_scaled <- scale(Boston)
## [1] "Euclidean distance"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.620 170.500 226.300 371.900 626.000
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion
## [1] "Manhattan distance"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2849  8.8390 13.0300 13.8700 18.1000 46.8900

Let us visualise the k-means of the dataset. Here the optimal number of clusters is determined by looking at the total of within cluster sum of squares (WCSS), namely, where the line in a line plot drops rapidly.

Based on the above, we select 2 as the optimal number of clusters. Visualised, the clusters appear as follows

Super-bonus

We shall create a 3D-plot of the data across the three linear dimensions, setting the colour to represent crime quantiles.

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## [1] 404  13
## [1] 13  3

Looking at the same model with color signifying the clusters:

## [1] 404  13
## [1] 13  3
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Here the low and median low are merged into cluster 1, whereas high and median high make up the second cluster. A 3D view illustrates the heterogeneity of the clusters: cluster 1 has limited y-values (LD2), but a far range of x-values (LD1), whereas for cluster 2 the opposite is true. Both clusters have values on the z-value. However, it should be noted that LD3 had the smallest proportion of trace and also that of LD2 was quite low. In light of this, the plotting seems appropriate enough.


Dimensionality reduction techniques

Because analysing and visualising tens of dimensions is much more difficult, it is often useful to reduce the number of dimensions by merging correlating variables together. In this analysis we conduct a Principle Component Analysis _ and a_ Multiple Correspondence Analysis_._

Data overview

The data we use here was created as part of the UN’s Human Development Programme, reflecting the Human Development Index (HDI), Gross National Income, and variables such as the expected education level or percentage of women in labor in different countries. For an overview, the variables are:

## 'data.frame':    155 obs. of  8 variables:
##  $ eduratio: num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labratio: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ exped   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifexp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni     : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ matmor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adbirth : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlperc: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

i.e.

  • eduratio: the ratio of the percentage of females vs. males with at least secondary education
  • labratio: the ratio of the percentages of females vs. males in the labour force
  • exped: expected years of education
  • lifexp: life expectancy at birth
  • gni_cap: Gross National Income per capita
  • matmor: maternal mortality ratio
  • adbirth: adolescent birth rate
  • parlperc: percentage of female representatives in parliament

Note that any countries with missing values for any variables have been deleted from the present data.

Principal Component Analysis with Non-Standardized Data

First we shall conduct a Principal Component Analysis (PCA) using the same data as above with non-standardized variables.

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

As seen in the biplot visualisation above, GNI clearly stands out from the other variables. However, this is due to this variable being numerically on a completely different scale than the other variable values, tens of thousands vs. <1 or <100% ratios. This causes the correlation to be unreliable (with only outliers showing) not to mention difficult to analyse. The data must thus be scaled.

Principal Component Analysis with Standardized Data

To amend for the abovementioned problem, we must conduct a PCA using a standardized set of the data, where mean = 0 and standard deviation 1. The results are visualised using PC1 and PC2, along with the percentages of explained variance for both principal components. Together the components explain 69.8 of the variance.

##     eduratio          labratio           exped             lifexp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       gni              matmor           adbirth           parlperc      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)"  "PC4 (7.6%)"  "PC5 (5.5%)" 
## [6] "PC6 (3.6%)"  "PC7 (2.6%)"  "PC8 (1.3%)"

From this biplot we can see that expected level of education correlates - expectedly - with the ratio of percentages of females vs. males with secondary education as well as gni and life expectancy at birth. Further, these four variables correlate negatively with maternal mortality and adolescent birth rate. We also see that the ratio of percentages of females vs males in labour correlates with the percentage of female representatives in the parliament.

Interpretations

I interpret the two principal components identified above as follows:

  • PC1: life and welfare
  • PC2: representation

Along PC1 (53.6% of variation) we have variables that represent life and mortality, but also education and gni, Meanwhile, PC2 (16.2% of variation) represents two variables tied to the representation of females in worklife and politics. Female representation alone does not guarantee low maternal mortality or adolescent birth rate (cf. Mozambique in biplot above), but high education levels or gni also do not guarantee equal representation (cf. Iran or Qatar)

Multiple Correspondence Analysis

Here we shall switch datasets and use data from the FactoMineR-package describing tea consumption and how tea is usually enjoyed. We shall analyse the data through Multiple Correspondence Analysis (MCA). The dataset has 36 variables in total, but here we are interested in the place where and time when the respondents of the survey consume tea. Our variables are: “SPC”, “price”, “frequency”, “how”, “How”. The results appear as follows:

## Warning: attributes are not identical across measure variables; they will
## be dropped

## 
## Call:
## MCA(X = tea2) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.293   0.263   0.248   0.243   0.233   0.221
## % of var.              9.772   8.766   8.268   8.085   7.774   7.357
## Cumulative % of var.   9.772  18.537  26.806  34.891  42.665  50.021
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.208   0.197   0.191   0.183   0.166   0.159
## % of var.              6.936   6.553   6.382   6.103   5.517   5.306
## Cumulative % of var.  56.957  63.509  69.891  75.994  81.511  86.817
##                       Dim.13  Dim.14  Dim.15
## Variance               0.139   0.129   0.128
## % of var.              4.618   4.312   4.253
## Cumulative % of var.  91.435  95.747 100.000
## 
## Individuals (the 10 first)
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1            |  0.683  0.530  0.204 |  0.325  0.133  0.046 | -0.369  0.183
## 2            |  0.480  0.262  0.083 | -0.147  0.027  0.008 | -0.360  0.174
## 3            | -0.240  0.065  0.017 | -0.551  0.385  0.087 | -0.611  0.501
## 4            |  0.231  0.061  0.033 | -0.259  0.085  0.041 | -0.370  0.184
## 5            | -0.033  0.001  0.001 | -0.047  0.003  0.001 |  0.060  0.005
## 6            |  0.231  0.061  0.033 | -0.259  0.085  0.041 | -0.370  0.184
## 7            |  0.665  0.502  0.122 |  0.526  0.351  0.076 |  0.502  0.339
## 8            |  0.381  0.165  0.041 |  0.080  0.008  0.002 | -0.739  0.733
## 9            |  0.712  0.576  0.155 | -0.159  0.032  0.008 |  1.220  2.002
## 10           |  0.359  0.147  0.049 |  0.183  0.042  0.013 |  1.131  1.720
##                cos2  
## 1             0.060 |
## 2             0.047 |
## 3             0.107 |
## 4             0.084 |
## 5             0.002 |
## 6             0.084 |
## 7             0.069 |
## 8             0.156 |
## 9             0.456 |
## 10            0.488 |
## 
## Categories (the 10 first)
##                 Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## employee     | -0.325  1.421  0.026 -2.785 | -0.196  0.572  0.009 -1.673 |
## middle       |  0.472  2.029  0.034  3.203 |  1.314 17.501  0.266  8.910 |
## non-worker   | -0.254  0.942  0.018 -2.291 | -0.038  0.023  0.000 -0.341 |
## other worker |  0.618  1.740  0.027  2.858 | -1.156  6.772  0.095 -5.341 |
## senior       |  1.211 11.680  0.194  7.612 |  0.080  0.057  0.001  0.506 |
## student      | -0.751  8.985  0.172 -7.167 | -0.182  0.585  0.010 -1.731 |
## workman      |  1.201  3.939  0.060  4.240 | -0.465  0.659  0.009 -1.643 |
## 1/day        |  0.611  8.075  0.173  7.196 | -0.520  6.515  0.125 -6.122 |
## 1 to 2/week  |  0.344  1.184  0.020  2.466 |  0.064  0.045  0.001  0.457 |
## +2/day       | -0.529  8.080  0.205 -7.837 |  0.038  0.046  0.001  0.558 |
##               Dim.3    ctr   cos2 v.test  
## employee      0.375  2.228  0.034  3.207 |
## middle       -0.369  1.466  0.021 -2.504 |
## non-worker   -0.188  0.609  0.010 -1.694 |
## other worker -1.094  6.428  0.085 -5.054 |
## senior        1.486 20.786  0.292  9.341 |
## student      -0.374  2.626  0.042 -3.564 |
## workman       0.058  0.011  0.000  0.204 |
## 1/day        -0.034  0.029  0.001 -0.395 |
## 1 to 2/week  -0.976 11.275  0.164 -7.000 |
## +2/day        0.289  2.845  0.061  4.278 |
## 
## Categorical variables (eta2)
##                Dim.1 Dim.2 Dim.3  
## SPC          | 0.451 0.344 0.424 |
## frequency    | 0.258 0.258 0.184 |
## how          | 0.063 0.547 0.521 |
## How          | 0.149 0.139 0.102 |
## sex          | 0.546 0.027 0.010 |

We need up to 6 dimensions to be able to explain even 50 % of the variance. However, it appears for example that the middle class favours unpackaged (loose leaf) tea above other society groups. Further, they do not drink tea with milk as much as the working class. (It is known in sociology that the middle class tends to distinguish itself from lower classes but that their practices differ from those of upper classes.) Raising concerns about the data is that students and females seem to have strong correlation, whereas senior correlates with male. This could indicate a data sampling where a sex is overrepresented in a group. However, investigating the data further falls outside the scope of the current analysis.